Dissected 3D CNNs: Temporal skip connections for efficient online video processing

نویسندگان

چکیده

Convolutional Neural Networks with 3D kernels (3D CNNs) currently achieve state-of-the-art results in video recognition tasks due to their supremacy extracting spatiotemporal features within frames. There have been many successful CNN architectures surpassing successively. However, nearly all of them are designed operate offline, creating several serious handicaps during online operation. Firstly, conventional CNNs not dynamic since output represent the complete input clip instead most recent frame clip. Secondly, they temporal resolution-preserving inherent downsampling. Lastly, constrained be used fixed size limiting flexibility. In order address these drawbacks, we propose dissected CNNs, where intermediate volumes network and propagated over depth (time) dimension for future calculations, substantially reducing number computations at For action classification, version ResNet models performs 77%–90% fewer operation while achieving ∼5% better classification accuracy on Kinetics-600 dataset than 3D-ResNet models. Moreover, advantages demonstrated by deploying our approach onto vision tasks, which consistently improved performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Efficient Online Spatio-Temporal Filtering for Video Event Detection

We propose a novel spatio-temporal filtering technique to improve the per-pixel prediction map, by leveraging the spatio-temporal smoothness of the video signal. Different from previous techniques that perform spatio-temporal filtering in an offline/batch mode, e.g., through graphical model, our filtering can be implemented online and in real-time, with provable lowest computational complexity....

متن کامل

Efficient Two-Stream Motion and Appearance 3D CNNs for Video Classification

The video and action classification have extremely evolved by deep neural networks specially with two stream CNN using RGB and optical flow as inputs and they present outstanding performance in terms of video analysis. One of the shortcoming of these methods is handling motion information extraction which is done out side of the CNNs and relatively time consuming also on GPUs. So proposing endt...

متن کامل

3D-List: A Data Structure for Efficient Video Query Processing

Chih-Chin Liu and Arbee L. P. Chen* Department of Computer Science National Tsing Hua University Hsinchu, Taiwan 300, R.O.C. Email : [email protected] Abstract In this paper, a video query model based on the content of video and iconic indexing is proposed. We extend the notion of two-dimensional strings to threedimensional strings (3D-Strings) for representing the spatial and temporal rel...

متن کامل

Self-Supervised Visual Planning with Temporal Skip Connections

In order to autonomously learn wide repertoires of complex skills, robots must be able to learn from their own autonomously collected data, without human supervision. One learning signal that is always available for autonomously collected data is prediction. If a robot can learn to predict the future, it can use this predictive model to take actions to produce desired outcomes, such as moving a...

متن کامل

Efficient spatio-temporal decomposition for perceptual processing of video sequences

This paper presents a vision model for moving pictures. The model is an extension of a normalization model by T eo and Heeger. It accounts for normalization of the cortical receptive eld responses and inter-channel masking. The model is compared with a simpler vision model for video by presenting results on quality assessment of MPEG compressed video.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Computer Vision and Image Understanding

سال: 2022

ISSN: ['1090-235X', '1077-3142']

DOI: https://doi.org/10.1016/j.cviu.2021.103318